Feat/Introduce AI semantic search! #980

vaclisinc · 2025-11-16T22:06:48Z

Summary

Finally bringing semantic search to reality!
This PR integrates a FAISS-backed semantic search service using BGE (BAAI General Embedding) models (based on Jacky’s work from last semester), along with backend proxy endpoints and updated catalog UX, enabling AI course search.

⚠️ IMPORTANT: Remember to add SEMANTIC_SEARCH_URL=http://semantic-search:8000 to your .env!

⚠️ IMPORTANT: First-time index building may take 2-3 minutes. In the future, index building will automatically trigger when running the datapuller.
Please run the commands below and modify the semester to build the index first!

curl -X POST http://localhost:8000/refresh \
       -H 'Content-Type: application/json' \
       -d '{"year": 2026, "semester": "Spring"}' | jq

System Architecture

flowchart LR

    %% ---------- Frontend ----------
    subgraph Frontend
        FE_UI["Search Bar + AI Search Toggle"]
    end

    %% ---------- Node Backend ----------
    subgraph NodeBackend
        ProxyRouter["/api/semantic-search/*  (proxy router)"]
        CoursesAPI["/api/semantic-search/courses  (lightweight endpoint)"]
        GraphQLResolvers["GraphQL resolvers + hasCatalogData"]
    end

    %% ---------- Python Semantic Service ----------
    subgraph SemanticService["Semantic Search Service (FastAPI)"]
        Health["/health"]
        Refresh["/refresh  (rebuild FAISS index)"]
        Search["/search  (threshold-based semantic query)"]
        BGE["BGE Embedding Model"]
        FAISS["FAISS Index (cosine similarity)"]
    end

    %% ---------- Catalog Data Puller ----------
    subgraph CatalogData
        DataPuller["GraphQL Catalog Datapuller"]
    end

    %% ---------- Data Flow ----------
    FE_UI -->|Search Query| CoursesAPI

    CoursesAPI -->|Forward to Python| Search

    Search -->|Generate Query Embedding| BGE
    Search -->|Vector Similarity Search| FAISS
    FAISS -->|Threshold-filtered Results| Search

    Search --> CoursesAPI --> FE_UI

    %% Index refresh / data ingestion
    DataPuller --> GraphQLResolvers --> |TODO:|Refresh
    Refresh -->|Fetch Catalog via GraphQL| GraphQLResolvers
    Refresh -->|Generate Embeddings| BGE --> FAISS

Examples

Input: “Memory models in concurrent programming”
→ Should return courses like databases, operating systems, etc.
→ Should not return biology or psychology courses just because of the word “memory.”

Input: “how to shot a hot vlog”

Implementation Details

Python Semantic Search Service (FastAPI)

FastAPI microservice (apps/semantic-search) that:
- Uses BGE (BAAI/bge-base-en-v1.5) embedding model optimized for retrieval tasks
- Builds term-specific embeddings + FAISS indices from GraphQL catalog data
- Implements threshold-based filtering (returns all results above similarity threshold, not just top-k)
- Searches top 500 candidates for performance, then filters by threshold (default: 0.45)
Key endpoints:
- /health — readiness probe showing index status
- /refresh — rebuild FAISS index for a given year/semester
- /search — semantic query with threshold filtering
Model Architecture:
- Uses instruction prefix for queries: "Represent this sentence for searching relevant passages: {query}"
- Course text format: SUBJECT: {subj} NUMBER: {num}\nTITLE: {title}\nDESCRIPTION: {desc}
- FAISS IndexFlatIP with L2-normalized embeddings (cosine similarity)

Example: manually refreshing an index

curl -X POST http://localhost:8000/refresh \
     -H 'Content-Type: application/json' \
     -d '{"year": 2026, "semester": "Spring"}' | jq

Example: running a semantic search

# Threshold-based search (returns all courses with similarity > 0.45)
curl "http://localhost:8000/search?query=deep%20reinforcement%20learning&year=2026&semester=Spring&threshold=0.45" | jq

# Response includes similarity scores for ranking
{
  "query": "deep reinforcement learning",
  "threshold": 0.45,
  "count": 12,
  "results": [
    {
      "subject": "COMPSCI",
      "courseNumber": "285",
      "score": 0.713,
      "title": "Deep Reinforcement Learning, Decision Making, and Control"
    },
    ...
  ]
}

Backend Integration (Node / Express)

Added SEMANTIC_SEARCH_URL environment variable pointing to Python service
Implemented lightweight proxy endpoint /api/semantic-search/courses:
- Forwards requests to Python service
- Returns only {subject, courseNumber, score} for efficient frontend filtering
- Frontend maintains API response order (sorted by semantic similarity)
Updated GraphQL behavior:
- Introduced hasCatalogData field for term filtering
- Updated resolver to use terms(withCatalogData: true)

Frontend (Catalog UI)

AI Search toggle (✨ sparkle button) to activate semantic search mode
Semantic results preserve backend ordering (by similarity score)
Frontend maps semantic results to full course objects for display
Graceful fallback to fuzzy search when semantic search unavailable

Technical Decisions

Why BGE over other models?

BGE (BAAI General Embedding) is specifically optimized for retrieval tasks
Better semantic understanding than general-purpose models (all-MiniLM, mpnet)
Supports instruction prefixes for improved query understanding
109M parameters - good balance of accuracy and speed

Why threshold instead of top-k?

Threshold-based filtering returns all relevant results, not arbitrary top-k
More flexible - can return 5 results for specific queries, 50 for broad queries
Similarity score threshold (0.45) ensures quality over quantity
Searches top 500 candidates for performance, then applies threshold

Model Options Available (hardcoded in `main.py`)

# Current: BAAI/bge-base-en-v1.5 (best for retrieval)
# Alternatives:
#   BAAI/bge-small-en-v1.5       (faster, 33M params)
#   BAAI/bge-large-en-v1.5       (most accurate, 335M params)
#   all-mpnet-base-v2            (general purpose, 110M params)
#   all-MiniLM-L6-v2             (fastest, 22M params)

Next Steps

Datapuller Integration: TOP PRIORITY!
Automatically trigger /refresh endpoint when new catalog data is pulled
Fine-tuning for Berkeley Courses
Collect user feedback dataset (query + relevant/irrelevant courses) to fine-tune BGE specifically for Berkeley course search

e. Query Expansion
Handle abbreviations (NLP → Natural Language Processing) and synonyms

Based on: Initial prototype by Jacky (last semester)
Frontend integration: @PineND

apps/frontend/src/components/ClassBrowser/index.tsx

maxmwang · 2025-11-17T04:31:33Z

Will review later, but this will need infra changes before being merged.

…d based result

…-k to threshold=4.5

vaclisinc · 2025-11-18T02:31:22Z

Hey guys! I've implemented semantic search on Berkeleytime using BGE embeddings. It already works pretty well, but I’d like to fine-tune it specifically for Berkeley courses. 🎯

I need your help building a small training dataset.

Please actually try some searches on Berkeleytime (using the new semantic search), and whenever you see results that look clearly wrong or surprising, send me an example in this format:

{
  "query": "planning about my career",
  "good_results": ["MBA 209P", "MCELLBI 295"],
  "bad_results": ["LDARCH 205", "CYPLAN 116", "CYPLAN 208"],
  "missing_courses": ["IAFIRCAM 198BC", "ARCH 198BC", "MUSIC 198BC", "COMLIT 198BC"] // optional
}

Where:

query = what you searched for (it can be a full sentence, not just keywords)
good_results = courses from the search results that ARE relevant (should rank high)
bad_results = courses from the search results that feel clearly NOT related
missing_courses (optional) = courses you strongly expected to see but that did NOT show up at all

What I’m especially looking for:

Natural language queries like:
- “planning about my career”
- “I want to get into AI research from a non-CS background”
- “I like math but hate proofs, what should I take?”
Any query where the results feel clearly off, noisy, or surprising

Goal: ~50–100 examples total.
Even 3–5 examples from you would be super helpful. 🙏

* disable sections/lectures, scroll hide bug, event clear bug, calendar month bug, leftborder color bug * fix: hardcode color for sidebar header * fix: urgent bug of cannot adding a class which is not in primarySection (access _class before initialization) * fix: minor format error --------- Co-authored-by: vaclis.mbp <[email protected]>

* hasCatalogItem true by default * classes datapuller populate terms * Avoid N+1 enrollment fetches in getCatalog * clean up

Restores courseId, course.subject, and course.number fields to GET_CANONICAL_CATALOG_QUERY that were removed in d0a37f1. These fields are essential for cross-listed course functionality: - courseId: Required by Class.course resolver (class/resolver.ts:112) to fetch the parent course record for cross-listed courses like DATA C100 / STAT C100 which share courseId but have different subjects - course.subject & course.number: Required by the resolver override (class/resolver.ts:118-119) and grade lookup key generation (catalog/controller.ts:320) to ensure each cross-listed variant displays the correct department name and historical grade distribution Without these fields: - Cross-listed courses cannot be identified (no courseId linking) - Grade distributions show incorrect data (wrong lookup keys) - Course metadata displays wrong department (no subject override) Database verification shows courseId '148047' links DATA C100 and STAT C100 across multiple terms, confirming the need for these fields. Note: termId/sessionId not restored as catalog controller pre-populates enrollment data, making those fields unnecessary for this query.

…isting The catalog controller was not overriding course.subject and course.number with class-specific values, causing cross-listed courses to be indexed and displayed with incorrect department names. Issue: - DATA C100 and STAT C100 share courseId '148047' - The parent course record has subject: 'STAT', number: 'C100' - When catalog fetches DATA C100, it was using the parent course's subject - This caused getIndex() (line 33) to index DATA C100 as 'STAT C100' - Search results would show wrong department for cross-listed variants Fix: Added override logic after formatCourse() to set: - formattedCourse.subject = _class.subject (DATA, not STAT) - formattedCourse.number = _class.courseNumber (C100) This matches the behavior in Class.course resolver (class/resolver.ts:118-119) and ensures consistency across all code paths that access course metadata. Benefits: - Search indexing uses correct subject for each cross-listed variant - Course metadata displays correct department in all contexts - Behavior now consistent between catalog controller and GraphQL resolver

* reassign color to fit timeframe * dynamically determined time * rename csv * only sunrise/sunset, daytime, nightime --------- Co-authored-by: maxmwang <[email protected]>

… cross-listing" This reverts commit d87931b.

…log query" This reverts commit 8ec4795.

* feat: adding enrollment tab into catalog * fix: minor format * Make enrollment graph same as the Enrollment page * linting * styling improvements --------- Co-authored-by: PineND <[email protected]>

…ug (#993)

* pill v1 * finished section * small fixes * Update apps/frontend/src/components/Class/Sections/Sections.module.scss Co-authored-by: Copilot <[email protected]> * copilot fixes * accessible table * location support --------- Co-authored-by: Copilot <[email protected]>

apps/frontend/src/app/Enrollment/CourseManager/CourseInput/index.tsx

      semester: semester as Semester,
-      sectionNumber: selectedClass.primarySection.number,
+      sectionNumber:
+        selectedClass === null


vaclisinc requested review from ARtheboss, PineND and maxmwang November 16, 2025 22:06

github-code-quality bot found potential problems Nov 16, 2025

View reviewed changes

apps/frontend/src/components/ClassBrowser/index.tsx Fixed Show fixed Hide fixed

apps/frontend/src/components/ClassBrowser/index.tsx Fixed Show fixed Hide fixed

vaclisinc self-assigned this Nov 16, 2025

vaclisinc marked this pull request as draft November 16, 2025 22:17

vaclisinc marked this pull request as ready for review November 16, 2025 22:21

vaclisinc force-pushed the feat/semantic-search-vaclis branch from 4fa576b to a4db7f3 Compare November 16, 2025 23:12

vaclisinc had a problem deploying to development November 16, 2025 23:15 — with GitHub Actions Failure

vaclisinc added 7 commits November 17, 2025 13:07

feat: add semantic search service container

ea1e69e

feat: proxy semantic search through backend

28ced4f

feat: add natural language catalog search UI

86aac3e

fix: format

691d0ff

fix: remove hardcode AI_KEYWORDS and switching from top-k to threshol…

4098304

…d based result

fix: use the score from transformer model to sort and change from top…

171a34d

…-k to threshold=4.5

fix: minor format

031bb9e

vaclisinc force-pushed the feat/semantic-search-vaclis branch from a4db7f3 to 031bb9e Compare November 18, 2025 00:39

fix: changing the model in python instead of modifing the infra

0fe511c

vaclisinc changed the title ~~Feat/Introduce semantic search!~~ Feat/Introduce AI semantic search! Nov 18, 2025

PineND added the Catalog pod label Nov 24, 2025

vaclisinc and others added 8 commits December 2, 2025 19:01

fix: modify to a more fancy search bar name - AI-Native Course Search!

f12fdf0

debugging

5f99abe

fix: seatReservationTypes missing problem

0ba3917

border for chips (#984)

72e0c5f

Feat/improved catalog puller (#982)

93fd473

* hasCatalogItem true by default * classes datapuller populate terms * Avoid N+1 enrollment fetches in getCatalog * clean up

reserved-seating badge, no popup

1ba847c

hasReservedSeating for catalog

c020a39

PineND and others added 26 commits December 2, 2025 19:01

improved tooltip for reserved seating

570a4f4

reserved seating text

b4be2f6

reservedSeatingMaxCount to replace hasReservedSeating

e3e7bef

drop down

9fea663

appearance improvements

484fde6

smart link detection

dd08334

remove console logs

10c2c91

simplify Class Notes

530f27b

small refactor, seatReservationTypes populating migration (#989)

5355e44

Feat/dynamically determined sky (#983)

e61e304

* reassign color to fit timeframe * dynamically determined time * rename csv * only sunrise/sunset, daytime, nightime --------- Co-authored-by: maxmwang <[email protected]>

Revert "fix: override course subject/number in catalog controller for…

0524ebb

… cross-listing" This reverts commit d87931b.

Revert "fix: restore critical fields for cross-listed courses in cata…

9e622a9

…log query" This reverts commit 8ec4795.

removed old getIndex (server side fuzzy search implementation

c8e7464

addressed comment

9f0533e

removed deprecated pinning functionality

7a70c7d

make typescript happy

09d537d

remove prereq display if no prereq

449ab89

skeleton loading for catalog (#992)

6d0db8c

Feat/enrollment support catalog (#986)

a2a1f8a

* feat: adding enrollment tab into catalog * fix: minor format * Make enrollment graph same as the Enrollment page * linting * styling improvements --------- Co-authored-by: PineND <[email protected]>

fix: Cannot return null for non-nullable field Class.primarySection b…

714d4d7

…ug (#993)

feat: infra k8s support and refresh sync with datapuller

507ed8b

fix: minor format

e70ea61

fix: temporary remove test files

94d9609

vaclisinc force-pushed the feat/semantic-search-vaclis branch from 3e0ecce to 94d9609 Compare December 3, 2025 03:10

github-code-quality bot found potential problems Dec 3, 2025

View reviewed changes

apps/frontend/src/app/Enrollment/CourseManager/CourseInput/index.tsx

semester: semester as Semester,

sectionNumber: selectedClass.primarySection.number,

sectionNumber:

selectedClass === null

vaclisinc marked this pull request as draft December 3, 2025 03:12

vaclisinc closed this Dec 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/Introduce AI semantic search! #980

Feat/Introduce AI semantic search! #980

Uh oh!

vaclisinc commented Nov 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

maxmwang commented Nov 17, 2025

Uh oh!

vaclisinc commented Nov 18, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Feat/Introduce AI semantic search! #980

Feat/Introduce AI semantic search! #980

Uh oh!

Conversation

vaclisinc commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

System Architecture

Examples

Implementation Details

Python Semantic Search Service (FastAPI)

Example: manually refreshing an index

Example: running a semantic search

Backend Integration (Node / Express)

Frontend (Catalog UI)

Technical Decisions

Why BGE over other models?

Why threshold instead of top-k?

Model Options Available (hardcoded in main.py)

Next Steps

Uh oh!

Uh oh!

Uh oh!

maxmwang commented Nov 17, 2025

Uh oh!

vaclisinc commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

vaclisinc commented Nov 16, 2025 •

edited

Loading

Model Options Available (hardcoded in `main.py`)

vaclisinc commented Nov 18, 2025 •

edited

Loading